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Abstract: The Bradley- Terry model is a popular approach to describe probabilities of the possible 
outcomes when elements of a set are repeatedly compared with one another in pairs. It has found 
many applications including animal behaviour, chess ranking and multiclass classification. Numer- 
ous extensions of the basic model have also been proposed in the literature including models with 
ties, multiple compa risons, group comparisons and random graphs. From a computational point of 



view, 



Hunter! (j2004[ ) has proposed efficient iterative MM (minorization-maximization) algorithms 
to perform maximum likelihood estimation for these generalized Bradley- Terry models whereas 
Bayesian inference is typically performed using MCMC (Markov chain Monte Carlo) algorithms 
based on tailored Metropolis-Hastings (M-H) proposals. We show here that these MM algorithms 
can be reinterpreted as special instances of Expectation-Maximization (EM) algorithms associated 
to suitable sets of latent variables and propose some original extensions. These latent variables 
allow us to derive simple Gibbs samplers for Bayesian inference. We demonstrate experimentally 
the efficiency of these algorithms on a variety of applications. 

Key-words: Bradley-Terry model, data augmentation, EM algorithm, Gibbs sampler, maximum 
likelihood estimation, MCMC algorithms, MM algorithm, Plackett-Luce model 
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Inference bayesienne pour les modeles de Bradley- Terry 

generalises 



Resume : Le modele de Bradley- Terry est une approche populaire pour decrire les resultats 
possibles lorsque des elements d'un ensemble sont mis en comparaison par paire. II a trouve de 
nombreuses applications incluant le comportement animal, le classement de joueurs d'echecs et 
la classification multi-classes. Plusieurs extensions du modele classique ont ete proposees dans la 
litter ature arm de prendre en compte des matchs nuls, des co mparais o ns m ultiples et des com- 



paraisons entre groupes. D'un point de vue computationel, iHunterl (|2004l ) a propose des al- 
gorithmes MM (minorization-maximization) iteratifs efficaces pour l'estimation du maximum de 
vraisemblance dans les modeles de Bradley- Terry generalises, tandis que l'inference bayesienne est 
realisee a l'aide d'algorithmes MCMC (Markov chain Monte Carlo) basees sur des lois de proposi- 
tion Metropolis-Hastings (M-H) adaptees. Nous montrons que ces algorithmes MM peuvent etre 
reinterpretes comme des instances d'algorithmes EM (Expectation-Maximization) associes a des 
ensembles de variables latentes. Ces variables latentes nous permettent de deriver des algorithmes 
de Gibbs simples pour l'inference bayesiennes. Nous demontrons experimentalement 1'efficacite de 
ces algorithmes sur plusieurs applications. 

Mots-cles : Modele de Bradley- Terry, Algorithme EM, Echantillonneur de Gibbs, Estimation 
par maximum de vraisemblance, algorithmes MCMC, algorithme MM, Modele de Plackett-Luce 
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1 Introduction 



Consider a set of K elements. These elements are repea t edly compared with one another in pairs. 
For two elements i and j of this set, Bradley and Terry ( 19521 ) suggested the following model 



Pr(i beats j) 



Aj + Xj 



(1) 



, K} that represents its skill rating 



where A; > is a parameter associated to element I € {1,2, 
and we denote A := {Xi}f =1 . 

This model has found numerous applications. As mentioned in (jHunterl . 120041) . as early as 1976, 

a pub lished bibliography on paired comparisons includes several hundred entries ([Davidson and Farquhar , 
19761 ). For example, it has been adopted by the World Chess Federation and the European Go 
Federation to rank players a nd it is a standard app r oach to build multiclass classifiers based on the 
output of binary classifiers (iHastie and Tibshiranil . Il998l). Various ext e nsion s have been proposed 
to handle home advantage ( Agresti . 1990h. draws (Rao and Kupper , 19671 ) . multiple ( Plackett . 
19751 : lLuceUl959h and team comparisons (jHuang et al.L I200C). In par t icular, the popula r extension 



p opula r extension 
1 1 19591 ). defines a 



to multiple comparisons, named the Plackett-Luce model (jPlackettl . 119751 : iLuce 
prior distribution over permut ations and has been used fo r rank ing o f multiple indiv iduals and 
for choice models (ILucd . 119771 ). The monog raphs of bavidl (|l988l ) and biaconis] dl988l . Chap. 9) 
provide detailed discussions on the statistical foundations of these models. 

For the basic Bradley- Terry model ([T|), it is possible to find the maximum l ikelihood (ML ) 
estimate of t he s kill ratings A using a simple iterative procedure (jZermelol . Il929l : iHunterl . l2004l ). 
Lange et al. (I2000I ) established that this procedure is a specific case of the general class of algorithms 
referred to as MM algorithms. Generally speaking, MM algorithms use surrogate minimizing 
functions of the log-likelihood to define an iterative procedure converging to a local maximum. 
EM algorithms are thus just a special case of MM algorithms. A n excell e nt su rvey of the MM 
approach and its applications can be found in (lLange et all . l2000h . IHunterl (|2004l ) further derived 
MM algorithms for generalized Bradley- Terry models and established sufficient conditions under 
which these algorithms are guaranteed to converge towards the ML estimate. 

Recently s everal authors have proposed to perform Bayesian inference for (generalized) Bradley- 



■ayc 
)9; 



2009; IGoriir et al.1. 12006: Guiver and Snelsonl, 



Terry models (jAdamsl . 120051 : iGormlev and Murphy 

20091 ). The resulting posterior density is typically not tractable and need s to b e approximated. 
An Expectation-Propagation method is developed in (jGuiver and Snelsonl . 120091 ): this yields an 
approximation of the posterior which can be computed quickly and might be suitable for very 
large scale applications. However, it relies on a functional approximation of the posterior and 
the converge nce propertie s of this algorithm are not well- understood. M- H algorithms have been 
propo sed in (lAdamsl . 120051 : IGormlev and Murphvl . 120091 : IGoriir et all 120061 ) . IGormlev and Murphy! 
(|2009l ) suggested a carefully designed proposal distribution, though it can perform poorly in some 
scenarios as demonstrated in section [H 

Our contribution here is three- fold. F irst, we show that by introducing suitable sets of latent 
variables, the MM algorithms proposed by Hunter ( 20041 ) for the basic Bradley- Terry model and its 
generalizations to take into account home advantage, ties and multiple comparisons can be reinter- 
preted as standard EM algorithms. We believe that this non-trivial reinterpretation is potentially 
fruitful for statisticians who usually like thinking in terms of latent variables. Note that the latent 
variables introduced here differ f rom the ones in troduced in the standard Thurstonian interpreta- 



tion of the Bradley- Terry model (jDiaconisl . Il988l . Chap. 9) and lead to more efficient algorithms as 
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discussed in section [2j Second, using similar ideas, we propose original EM algorithms for some re- 
cent generalizations of the Bradley- Terry model including group comparisons and random graphs. 
Third, based on the sets of latent variables introduced to derive these EM algorithms, we propose 
Gibbs samplers to perform Bayesian inference in this important class of models. To the best of our 
knowledge, no Gibbs sampler has ever been proposed in this context. These algorithms have the 
great advantage of allowing us to bypass the design of proposal distributions for M-H updates and 
we demonstrate experimentally that they perform very well. 

The rest of this paper is organized as follows. In section [21 we consider the basic Bradley- 
Terry model (H|). Based on the introduction of a s uitable set of l atent variables, we present an EM 
reinterpretation of the MM algorithm presented by lHunter for Maximum a Posteriori (MAP) 

parameter estimation and an original data augmentation algorithm to sample from the posterior. 
In section El various standard extensions of the Bradley- Terry model allowing for home advantage, 
ties and competition between teams are described. EM algorithm s and o rigina l Gibbs sampling 
schemes are proposed. The Plackett-Luce model dPlackettJ . 1 197.4 iLuccL 1 19591 ). a very popular 
generalization of the Bradley- Terry model for multiple comparisons, is presented in section |H 
Algorithms applicable to further extensions of the Bradley- Terry model to choice models, random 
graphs and classification are presented in section [5j In section [61 these algorithms are applied to 
the NASCAR 2002 dataset and to chess competition data. 



2 Bradley- Terry model 

Suppose we have observed a number of statistically independent pairwise comparisons among K 
individuals. We denote by D the associated data. Let also Wij denote the number of comparisons 
where i beats j and n« = Wij + wu the total number of comparisons between i and j. Based on 
the Bradley- Terry model fll]), the log- likelihood function is given by 

l<i^j<K 

l<i¥=j<K l<i<j<K 

where the notation 1 < i ^ j < K is an abusive notation to denote the set £ {1, ...,K} 2 

such that i ^ j} and 1 < i < j < K stands for j) E {1, K} 2 such that i < j\. 

We seek to introduce latent variables which are such that the resulting complete log-likelihood 
admits a simple for m. It is well-kno wn that the Bradley- Terry model enjoys the following Thursto- 
nian interpretation ( Diaconisl . 19881 . Chap. 9): for each pair 1 < i < j ' < K and for each associated 



pair comparison k = l,...,rijj, let Y^i ~ £(K) and Y^j ~ £(Xj) where £ (<r) is the exponential 
distribution of rate parameter q then 

A,; 



Pr(Y kt <Y kj )-- J - TJ -. 

These latent variables can be interpreted as arrival times and the individual with the lowest arrival 
time wins. These latent variables would allow us to define EM and data augmentation algorithms. 
However, if we only introduce for each pair i, j the sum of the lowest arrival times 



Zij = ^2 mm(Y kj ,Y ki ) ~ Q{riij, \ + Xj) 



k=i 
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instead of {Y^i, Y^j} for k = 1, . . . , n%j, then the resulting complete log-likelihood remains simple. 
Here Q (a, i3) denote the Gamma distribution of shape a and inverse scale f3. As the fraction 
of missing information is reduced, this l eads to faster rates of convergence for the resulting EM 
and data augmentation algorithms (|Liul . l200ll . Chap. 6). We will essentially proceed similarly to 
introduce latent variables for generalized Bradley- Terry models. 

To summarize, for 1 < i < j < K such that > 0, we introduce the latent variables Z = {Zij} 
which are such that 

p(z\D,X)= Yl G(zij;riij,Xi + Xj) (2) 

l<i<j<K\nij>0 

The resulting complete data log-likelihood is given by 



t(\,z) 



£ 

l<i^j<K\wij>0 



Wij log Xi 



y~) (Aj + Xj)zij + (riij - 1) log z^ - log T(nij) (3) 

l<i<j<K\riij>0 



where T is the Gamma function. 

If we assign additionally a prior to A such that 

K 

p(X) = Hg(Xi-a,b) (4) 

i=i 

as in ( Gormlev and Murphy , 2009 ; Guiver and Snelson , 20091 ) then we can maximize the resulting 
log-posterior using the EM algorithm which proceeds as follows at iteration t: 



arg max 

x 



We have 



Q(X, A*) = E Z | AA , [£(X, Z)\ + log p (A) 
■1 + Wi) logAj - bXi ■ 



(5) 



(6) 



K 

£ 

i=l 



E 

l<i<j<K 



(Xi + Xj) — 



71: 



X* + Xj 



with "=" meaning "equal up to terms independent of the first argument of the Q function" and 



W: 



K 



Wij is the number of wins of element i. Using ([5]), it follows that 



A 



(t) 



1 + Wi 



b + T,j 1 Li At-1 



(7) 



For a = 1 and 6 = 0, the MAP and ML estim ates coincide. In this case ([6]) is exactly the minorizing 
function of the MM algorithm proposed in (jHunterl . |2004| . Eq. (10)) and thus the MM algorithm 
is given by (|7|). 

Based on the same latent variables, we present a simple data augmentation algorithm for 
sampling from the posterior distribution p (A, z\ D) oc p (A) exp (^(A, z)). By construction, we can 
update Z conditional upon A using ([2]) and the conditional for A given Z can be expressed easily 
so that the data augmentation sampler at iteration t proceeds as follows: 



For 1 < % < j < K s.t. riij > 0; sample 



z if I A ~ G( n iji K 
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• For i = 1, . . . , K, sample 

^\D,Z(^g(a + Wl ,b + E ^+ E Z fh 

i<j\riij>0 i>j\iiij>0 



3 Generalized Bradley- Terry models 
3.1 Home advantage 

Consider now that the pairwise comp arisons are modeled using the Bradley- Terry model with 
"home- field advantage" ( Agrestil . Il99d ) where 



Pr(i beats j) 



exJ+Aj if* is home, 



if j is home. 



The parameter 9, 9 > 0, measures the strength of the home-field advantage (9 > 1) or disadvantage 
(9 < 1). Let otjj be the number of times that i is at home and beats j and bij is the number of 
times that i is at home and loses to j. 

The log-likelihood of the skill ratings A and 9 is given by 

K 

£(X, 9) = clog 9 + ^2 Wi log Xi- E n ij log{9\ i + \j) 

i=l l<i^j<K 

where = + is the number of times i plays at home against j, c = Yli<i^j<K a n ^ s the 
total number of home-field wins and Wi is the total number of wins of element i. 

For 1 < i ^ j < K such that riij > 0, let us introduce the latent variables Z = {Z{j} which are 
such that 

p(z\D,\)= 11 Qi-.j^nj.OXi ■ Xj ) . (8) 

i<i^j<K\ nij >o 

The associated complete data log-likelihood is given by 



K 

£(X,9,z) = clog9 + J2 w i l °sK- E (#Aj + Xj)zij + logF 

i=l l<i^j<K\nij>0 



rkj) 



Using independent priors for A and 9, i.e. p (A, 9) = p (A) p (9), where p (A) is defined as @ and 

9~g(a ,b e ), (9) 
then we can maximize the resulting posterior using the EM algorithm 

(\®,6®) = argmaxQ((A,0),fA( t - 1 ),^- 1 ) N )) 

V ' (A,fl) V ' 

where we have 

Q((A, 9) , (X*,9*)) = m z{DX>e * [£(X, 9, Z)\ + log p (A, 9) 

K K 9Xi + X- 

= (a 6 -l + c)log9-b 6 9 + ^2(a-l + Wi)logXi-b^2Xi- E n ^ a*\* i \* 

i=l i=l l<i^j<K * i 



INRIA 



Bayesian Inference for Bradley- Terry Models 



7 



We obtain 



At) 



9 (t) 



a — 1 + Wi 



<i^j<K ) e( t-i) A (t-i) +A (t-i) + e ( t -i) A (t-i) +A (*-i) 
— 1 + c 



for i = 1, . . . , K, 



be + E 



v(*) 



For a = a# = 1 and b = bp = , i.e. i f we use flat priors, this EM algorithm is similar to the MM 
algorithm proposed in (jHunterl . 12004 . pp. 389). 

Using the same latent variables, we can sample from the posterior distribution of (A, 9, Z) using 
the Gibbs sampler which updates iteratively Z, A and 9 as follows at iteration t: 



For 1 < i ^ j < K s.t. nij > 0, sample 



Z^\D 



For i = 1, . . . , K, sample 



(*) 

ji 



j^i\mj>0 



j^i\mj>0 



Sample 



K 



9®\D,\®,Z®~g\a e + c,b e + 'E l )i i > £ Z® 

=1 jjLi\mj>0 



3.2 Model with ties 

If we now want to al l ow fo r ties in pairwise comparisons, we can use the following model proposed 
bv iRao and Kupperl (|l967l ) 



Pr(i beats j) 
Pr(i ties j) 



A; 



Xi + 9Xj ' 

(9 2 - T)XiXj 



(X i + 9X j )(9X i + X j ) 
where 9 > 1. The log-likelihood function for (X,9) is given by 



£(X,9)= w v lo Z 



A; 



+ ^lo. 



(9 2 - l)XiXj 



l<iytj<K 

l<i^j<K 



Xi + 9Xj 2 & (eXi + Xj^Xi + eXj) 

X + %l O g(0 2 -l) 



Aj + 9Xj 2 



where fy = tji is the number of ties between i and j and Sjj = Wij + . 
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For 1 < i 7^ j < K such that Sij > 0, let us introduce the latent variables Z = {Zij} which are 
such that 

p(z\D,X)= ]f G(zij;sij,Xi + e\j) 

l<iy£j<K\ Sij >0 

which yields the following complete log-likelihood 
£(X, 9, z) = T\og{9 2 - 1) + Y s v 1o S A * " ( A * + ex i) z ij + ( s ij ~ 1) lo S % - logr(s ij ) 

l<ift<K\sij>Q 

where T = ^ Si<i^y<if ^ s ^ e total number of ties. If we adopt for a flat improper prior on 
[1, oo) and we select p (A) as (jH) then we obtain 
Q((A, 0) , (A*,0*)) = E^n a*,0* [*(A, (9, Z)] + log p (A, 9) 



= Tlog(# 2 -l) + Y, s v lo S A i 



A; + 6>A 



A' 



l<j^j<A" 



A* + 6>*A^ 



^ U^^-lJlogAi-bAi 



i=l 



and we recover once again the minorizing function in (|Hunterl . |2004| . pp. 389-390) for a = 1 and 
6 = 0. This can be maximized using the following procedure 



• For i = 1, . . . , K, set 



A 



5*7 



Set 



where 



0« 



2c« 



4c(*) 2 



,(*) 



E 



l<i^j<K 



X?-V+9(t-Vxf-V 
1 j 



Using the same latent variables, we can sample from the posterior distribution of (A, 9, Z) using 
the following Gibbs sampler which updates iteratively Z, X and 9 as follows at iteration t: 

• For 1 < i ^ j < K s.t. Sij > 0, sample 

z%\d,x^,9^ ~ G (s^xf^ + e^x^) . 

• For i = 1, . . . , K, sample 



xf\DM l - l \Z^ ~G\a + Y^b+ Y zjf + 9^ Y Z j 



Y E 
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• Sample 

9®\D,\®,zW~ p (e\D,\®,Z®) 

where 

p(8\D, Z, A) oc (9 2 - 1) T exp f - ^ Zij 6 l e>1 . (10) 

\ l<i^j<K\ Sij >0 J 

It is possible to sample from (|10p exactly. By performing a change of variable 9 = 9 — 1, we 
obtain 

p(0|£>, Z, A) a (# 2 + 29) T exp I - ^ 

\ l<i¥=j<K\sij>0 

which is a mixture of Gamma distributions. 
3.3 Group comparisons 

Consider now that we have n pairwise comparisons betweem teams. For each comparison i = 
1, . . . , n, let 
C {1, . . . , K} be the win ning team and T~ C {1, . . . , K} the losing team where n T-~ = 
and Ti = T+ U Tf. Recently Euang et al.l <|200fih have proposed the following model 



Pr(^+ beats Tr) = ^H^L. (H) 
The log-likelihood function for A is thus given by 

i=l \ j& T+ ) \ieT< 

For i = 1, ...,n we introduce the latent variables Z = {Zi} and C = {C{\ such that 

p(z,c\D,X) =p(z\D,X)P(c\D,X) 

with 



p(z\D,X)=f[S \z i ;J2 X 



3 ' 



i=l \ jGTi 



Pr (c| £> A) = 1[_ Ac ' , with a G T + 



i= i^jeT+ 



A, 



where 8 (x; a) is the exponential density of argument x and parameter a. It follows that the 
complete log-likelihood is given by 

^(A,z,c) = f>gA Ci - I^TA, 
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The Q function associated to the EM algorithm is given by 
Q(A, A*) = E ZiC | AA * [£(A, Z, C)\ + log p (A) 

n \* \ K 

X j _ >'J- I. A 

n 

Oiik 



E E lo ^'y-^ - +E(° - !) **** - bX * 

i=l jeT + l^k(LT+ A k Z^jeTi A j k=1 

k=l \ i=l ^j&t A j J \ i= i l^j&T, A j J 



where = 1 if k £ and otherwise and 7^ = 1 if k G Tj and otherwise. It follows that the 
EM update is given by 

a ~ 1 + K E£=i 



A 



(t) 



7ifc 



v(«-l) 



Using the same latent variables, we obtain a data augmentation sampler to sample from 
p (A, z, c| -D) by iteratively sampling (Z, C) and A. This proceeds as follows at iteration t: 



For i = 1, n, sample 



Pr C 



k\D,X^ 



x 



(t-i) 



(t„i) , k G ^ . 



• For fe = 1, . . . , K , sample 

(n n N 

a +E<W 5 ' 6+ E^f 

where 5 UtV = 1 if u = v and otherwise. 

4 Multiple comparisons 



i=l 



8=1 



We now consider the popular Plackett-Luce model (|LuceL Il959l : IPlackettl . Il975l ) which is an exten- 
sion of the Bradley- Terry model to comparisons involving more than two elements. Assume that 
Pi < K individuals are ranked for comparison i where i = 1, ...,n. We write pi = (pa, ■ ■ ■ ,Pi Pi ) 
where pn is the first individual, pi2, the second, etc. The Plackett-Luce model assumes 



Pi-i 



p*tew = n = n vi^v- 

, =1 Z^k=j ^Ptk j =1 Z-/k=j A Pik 



(12) 



For i = 1, . . . , n and j = 1, . . . ,Pi — 1, we introduce the following latent variables Z = {Zij} 

n Pi-1 Pi 

p(z\D,X) = lHlS(z ir ,Yx p J 

i=l j=l k=j 
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which leads to the complete log-likelihood 



n Pi-l / Pi 

i=l j=l \k=j 

The Q function associated to the EM algorithm is given by 
Q(A,A*)=E Z | D)A , [£(X,Z)]+log p(A) 

n Pi-l sr^Pi \ K 

Z^k=j A Pik 



i=l j=l Z^fc=j p ifc fc=1 



log A fc - 6A fc 



which is once again equivalent to the majorizing function in (jHunterl . I2004J . pp. 398) for a = 1, 
6 = 0. It follows that the EM algorithm is given at iteration t by 

-i 



Xf = (a-l + w k ) 



n I Pi-l 



ijk 



i=l 



i=i EjfcL,-A. 



where Wk is the number of rankings where the fc th individual is not in the last ranking position and 
Sijk is defined as 

_ J 1 if k € {pij, . . .,p ipj } 



otherwise 



th 



i.e. Sijk is the indicator of the event that individual k receives a rank no better than j in the i 
ranking. 

To sample from p(X,z\ D), we can use the following data augmentation sampler. At iteration 
t, this proceeds as follows: 

• For i = l, ...,n, for j = 1, . . . ,pt — 1, sample 

k=j 



For k = 1, 



, -ftT, sample 



i=l j=l 



(t)> 



Using exactly the same augmentation, EM and Gibbs sample rs can be defined for furthe r 
extensions of these models such as mixtures of Plackett-Luce models ( Gormley and Murphy . 20081 ). 



5 Discussion 
5.1 Identifiability 

Consider the basic Bradley- Terry model and its extensions to group comparisons and multiple 
comparisons. Let us define 



K 



X; 



A = Y^ A« and = ~^ 



i=l 
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and write tt := {7Tj} i=1 . The likelihood is invariant to a rescaling of the vector A so the parameter 
A is not likelihood-identifiable and 

p(ir,A\D) =p(7r\D)p(A). 

From dU, it follows that tt ~ T>(a, . . . , a) where T> is the Dirichlet distribution and A ~ Q(Ka, b), 
hence 

kMAP _ aK ~ 1 
b 

To improve the mixing of the MCMC algorithms in this context, an additional sampling step can 
be added where we normalize the current parameter estimate A^ and then rescale them randomly 
using a prior draw for A. 

• For i = 1, . . . , K, set A* W = (t) A<*> where AW ~ g(Ka, b). 

E J= i \ 

This step can drastically improve the mixing of the Markov chain. However, if we are only 
interested in the normalized values tt of A then this additional step is useless. 

As an alternative, it is also possible to consider an EM algorithm for the basic Bradley- Terry 
model which does not require the introduction of a scale parameter. Assume tt ~ T>(a, . . . , a) and 
let us introduce latent variables My, Cy = (Cyi, . . . , CyAfy) for 1 < i ^ j < K such that rey > 

My ~ NB{nij,-Ki + ttj), 
Pr (C ijk = I) = „ 111 for k = 1, . . . My and I ^ i, j 

where NB(r,p) is the negative binomial distribution. The complete log-likelihood is given by 

K K K K r / n .- + m---l\ 

£(vr,m,c) = E E wylogTTj + E E E log I %3 J 7 j + ry fe log 7r fe 

* =1 3=1 j¥* *=1 j=lj^i k^i,j L y 

where ry^ is the number of cyj, Z = 1, . . . , my that take value k. Omitting the terms independent 
of 7r, the Q function is given by 

Q(7T,7T*) =E M \D,-x* [®C\D,M,n* [t(*,M,C)]] +log p(7T) 



E 



M\D,TT* 



i=l ,=i, i=l i=l.i&k=M.i \ V / ^tyhi 1 



j=l j=l,j^i i=l j=l,j=£i k^i,j 

+ log p (vr) 

K K K K * K 

E E E <%log7T; + E E E "*J ^ + ^ + ( Q ~ E 1 "^ + C 

i=l j=l,j^i i=l j=X,j^=ik^i,j 1 3 i=l 

K K K K 

= EK + < E E ^-rlogvr fc ) + (a-l)E 1 og^ + C' 

fc=l i=l,i^fc j=lj=£i : k 1 3 i=l 

where C is a term independent of 7r. It follows that the EM update is given by 

if if 

(*) i , (t-l) n u 
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with X/fe=i = ^" Although the above EM algorithm does not rely on unidentifiable scaling 
parameters, it suffers from a slow convergence rate. When 7Tfc takes small values, Yjtyk Ylj^k IF^f 
is large and it slows down the convergence of the EM algorithm. The same augmentation can be 
used to define a Gibbs sampler, but the same slow mixing issues arise for the Markov chain. 



5.2 Hyperparameter estimation 

The prior (j4]) is specified by the hyperparameters a and b. However, the inverse scale parameter 
b is not likelihood identifiable so there is no point assigning a prior to it. However it might be 
interesting to set a prior p(a) on a and estimate it from the data. Given A, we have 



P 



(a\\)cxp(a)(b K H\)j T- K (a). 

v v ' '2(a) 



h(a) 



It is possible to sample from thi s density using auxiliary variables U\,U 2 defined on (0, oo) as 
described in (jPamien et all 1 19991 ). We introduce 



p(a,ui,u 2 \ X) <xp(a)I{ui < l\ (a)} l{u 2 < h (a)} ■ 

A Gibbs sampler can now be implemented to sample from p (a, u%, u 2 \ A). We can directly sample 
from the full conditionals of U\ and U 2 

Ui\\~U(0,h(a)), U 2 \\~U(0,l 2 (a)) 

where IA (a, f3) is the uniform distribution on (a, The full conditional of a given u\, u 2 is given 
by 

p(a\X,u 1 ,u 2 ) (xp(a)I Al nA 2 (a) 



where 



Ai = {a: k (a) > Ui} . 



Alternatively we can update a using a M-H random walk on log(a). We can propose a* = exp(cr^z)a 
where z ~ 7V(0, 1) and a* is accepted with probability 



min < 1 



p(a*) / r(a) 



p(a) \T{a k ) 



K 



K 



i=i 



5.3 Further extensions 

5.3.1 Random graphs with a given degree sequence 

A model cl osely related to Bradley- Terry has been proposed for u ndirected random graphs with 
K vertices ( Holland and Leinhardt . 1981 ; Chatter iee et al. . 2O10l : Park and Newman . 2004 ). In 
this model, the degree sequence (d±, . . . ,dx) of a given graph, where di is the degree of node i, 
is supposed to capture all the information in the graph. It can be formalized b y saying that the 



degre e sequence is a sufficient statistic for a probability distribution on graphs (IChatteriee et al 
20ld ). 
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In this model an edge is inserted between vertices i and j for 1 < i < j < K with probability 



1 + XiXj 



where A& > for k € {1, . . . , K}. Let = 1 if there is an edge between i and j and otherwise. 
Given the observations D = {t^/Ik^ <r-> the log-likelihood function for A is given by 

l<i<j<K 

We introduce the following latent variables Z = {Zij} l<i< j <K such that 

p(z\D,X)= H £(z ij ;X i + —). 



l<i<j<K 



The complete log-likelihood is given by 

£(X,z)= rijlogXi- (1 -rij)logXj - (Xi + —)zij 

The Q function associated to the EM algorithm is given by 
Q(A,A*)=E Z | AA * [l{\Z)]+log p(X) 



K 



i=l 



i) + E^-E( 1 

j>i j<i 



Solving dQ(X, X*)/dXi = requires solving a quadratic equation. For sake of brevity, we do not 
present these details here. 

Once again, we can define a data augmentation sampler to sample from p( X,z\D) by iteratively 
sampling Z and A. This proceeds as follows at iteration t: 



For 1 < i < j < K , sample 



(f) 



D, A ( ' _1) ~ £(xf~^ + — ^ 



For i = 1, K, sample 



X]\D, Z® ~ QTQ \ 2(£ Z$ + b), 2J2 Z$\ a + £ r %] - £(1 - nj ) ] . 

j>i j<i j>i j<i 



Here QTQ (a, f3, 7) denotes the generalized inverse Gaussian distribution (see e.g. (jBarndorff-Nielsen and Sheph 

200 lh ) whose density for an argument x is proportional to 



x^- 1 exp{-(ax + (3/x) /2} 



Algorithms to sample exactly from this distribution are available. 
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5.3.2 Choice models 



Oth er extensions of the Bradley- Terry model are the choice models introduced by iRestle! (I196lh 
and Tverskv ( 1972al lbl) in psychology; see also ( Wickelmaier and Shmidt , 2004 ; Goriir et all 2006 ). 
In these models, we are given a set of n elements. To each element i is associated a set of K 
features represented by a binary vector /, 6 {0, 1} K . The probability that element i is chosen over 
element j is given by 



Ylk=i ^kfik(l - fjk) 



Ef=i WifcU - fjk) + Efc=i WjJfeC 1 ~ fik) 



K 



where A& > is a weight representing the importance of feature k. The term Ylk=i ^kfikiX ~ fjk) 
corresponds to the sum of the weights of features possessed by object i but not object j. EM and 
Gibbs algorithms can be derived by following the same construction as with group comparisons. 



5.3.3 Classification model 

Let consider the following original model for categorical data analysis 

Pr(y, = k)= ^U eMXl])Xk3 
J2i=iE P j=i ex p(Xij)Xij 

= yEffl^ (13) 

where Xi G W is a vector of covariates and A& £ for k = l,... , K. This model could be 
used as an alternative to the multinomial logit model ( Agresti . 199dl ). By introducing latent 

variables Zi ~ 8 ( exp(X,) T A; j , we can define EM and Gibbs algorithms resp. to maximize 



the posterior distribution of the parameters and sample from it when the prior is given by 

6 Experimental results 

In all the above models, the parameter b is just a scaling parameter on A^. As the likelihood is 
invariant to a rescaling of the A&, this parameter does not have any influence on inference. Hence 
to ensure that the MAP estimate satisfies ^fc^=i X k = 1) we se ^ b = Ka — 1 henceforth as explained 
in section [5j We demonstrate our algorithms on one synthetic and two real-world data sets. 

6.1 Synthetic Data 

We first study the Plackett-Luce model, comparing experimentally the mixing prop erties of the 
Gibbs sampler relative to a slightly modified version of the M-H algorithm proposed by 



Gormlev and Murphy 



(2009). In this latter paper, the authors propose to update the skill parameters simultaneously 



using the following proposal distribution^ at iteration t 

I 

for i = 1, . . . , K, \\ ~ g \a + w k , b + 



>ijk 



»=1 \i=l ^k=j A Pik 



lr The authors actually use a normal approximation of the gamma distribution, and work with normalized data. 
To obtain similar algorithms, we consider unnormalized data. 
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We simulated 500 dataset of n rankings of K = 4 individuals, for various values of n with a = 5. 
For each dataset, 10,000 iterations of the Gibbs sampler presented in section 2] were run. The 
sample lag-1 autocorrelation was then computed for the four skill parameters. For a given sample 
size n, the mean value over skill parameters an d simulated data is reported on Figure [T] together 
with 90% confidence bounds. The algorithm of Gormley and Murphy (j2009l ) performs reasonably 
well when the sample size is large, which is the case for the voting data they considered, but poorly 
for small sample sizes. 




Figure 1: Sample lag-1 autocorrelation as a function of the sample size n for the Gibbs sampler 
and a modified version of the M-H algorithm of Gormley and Murphy (2009). 



6.2 Nascar 2002 dataset 

NASCAR is the primary sanctioning body for stock car auto racing in the United States. Each 
race involves 43 drivers. During the 2002 season, 87 different drivers participated in 36 races. Some 
drivers participated in all of the races while others participated in only one. We propose to apply 
the Placke tt-Luce model with gamma prior on the parameters. The NASCAR datasetH has been 



studied by [Hunter] ((2004), who noted that the MLE cannot be found for the original data set as 
four drivers placed last in each race they entered, and therefore had to be removed. This does not 
need to be done if we follow a Bayesian approach. We focus here on predicting the outcome of the 
next race based on the previous ones, starting from race 5; i.e. we predict the results of race 6 based 
on the MAP estimates obtained with the first 5 races, then the results of race 7 based on the MAP 
estimated obtained with the first 6 races, etc. For each race, we compute the test log-likelihood 
using the MAP estimates. The mean value and 90% confidence bounds are represented in Figure [2] 
w.r.t. the value of a. The EM algorithm was initialized using (Aj , . . . , Agg' 1 ) = (g^, . . . , g^). 

The Gibbs sampler was also applied to the same dataset. The skill parameters were initialized at 
the same value, and the parameter a was assigned a flat improper prior and sampled as described 
in section We ran 50,000 iterations with 2,000 burn-in. As detailed in Section only the 
normalized weights 7Tj are likelihood identifiable. Skill ratings are usually represented on the real 



2 The data can be downloaded from http://www.stat.psu.edu/ dhunter/code/btmatlab/ 
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Figure 2: Test log-likelihood on the Nascar 2002 dataset. From race 5 to 35, we compute the log- 
likelihood of the next race based on the MAP estimates. Mean and 90% interval of the log-likelihood 
is represented w.r.t. to the parameter a. The straight line represents the test log-likelihood obtained 
with a uniform prior. 



line, and we use the following one-to-one mapping = log7Tj — log 1/83. The marginal posterior 
densities of the reparameterized skill ratings for the first four drivers according to their average 
place are reported in Figure [3j The Bayesian approach can effectively capture the uncertainty in 
the skill ratings of the drivers. ML and MMSE (minimum mean squared error) estimates together 
with standard deviations are reported in Table [1] for the first ten and last ten drivers according to 
average place. 




Figure 3: Marginal posterior distribution for the modified skill ratings of the first 4 drivers 
according to their average place. P. Jones and S. Pruett only participated in 1 race each, while M. 
Martin and T. Stewart participated in 36 races. 
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Table 1: Top ten and bottom ten drivers according to average place, along with ML and MMSE 
estimates of the skill parameters in f3 space. Standard deviations are also provided. 







Average 


MLE 


MMSE 


Standard 


Driver 


Races 


place 


estimate 


estimate 


Deviation 


P. Jones 


1 


4.00 


2.74 


0.11 


0.48 


S. Pruett 


1 


6.00 


2.21 


0.10 


0.48 


M. Martin 


36 


12.17 


0.67 


0.79 


0.17 


T. Stewart 


36 


12.61 


0.42 


0.60 


0.17 


R. Wallace 


36 


13.17 


0.65 


0.78 


0.17 


J. Johnson 


36 


13.50 


0.53 


0.68 


0.17 


S. Marlin 


29 


13.86 


0.33 


0.49 


0.19 


M. Bliss 


1 


14.00 


0.82 


0.04 


0.48 


J. Gordon 


36 


14.06 


0.33 


0.53 


0.17 


K. Busch 


36 


14.06 


0.24 


0.46 


0.17 


C. Long 


2 


40.50 


-1.73 


-0.67 


0.46 


C. Fittipaldi 


1 


41.00 


-1.85 


-0.51 


0.50 


H. Fukuyama 


2 


41.00 


-2.17 


-0.81 


0.50 


J. Small 


1 


41.00 


-1.94 


-0.60 


0.51 


M. Shepherd 


5 


41.20 


-1.86 


-1.05 


0.39 


K. Shelmerdine 


2 


41.50 


-1.73 


-0.72 


0.46 


A. Cameron 


1 


42.00 


-1.41 


-0.44 


0.49 


D. Marcis 


1 


42.00 


-1.38 


-0.43 


0.49 


D. Trickle 


3 


42.00 


-1.72 


-0.87 


0.42 


J. Varde 


1 


42.00 


-1.55 


-0.48 


0.50 



6.3 Chess data 

Rating the skills of chess players is of major practical interest. It allows organizers of a tournament 
to avoid having strong players playing against each other at early stages, or to restrict the tour- 
nament to players with skills above a given threshold. The international chess fede ration adopted 



the so-called "Elo" system which is based on the Bradley- Terry model (Elo, 197 8J). For his torical 



considerations about the rating system in chess, the reader should refer to iGlickmanl ( 19951 ). 



We consider here game-by-game chess results over 100 months, consisting of 65,053 matches 
between 8631 playercL The outcome of each game is either win (+1), tie (+0.5) or loss (0). We 
estimate the parameters of the Bradley- Terry model with ties presented in section 13.21 on the first 
95 months and then predict the outcome of the games of the last 5 months. The hyperparameters 
for the tie parameter 9 are set to ag = 1, bg = 0. Given the large sample size, it is not possible to 
sample from Eq. fjlOf) as the number of elements in the mixture is very large. We therefore use a M- 
H step with a normal random walk proposal of standard deviation 0.1. The mean squared error is 
reported for predictions based on MAP estimates and full Bayesian predictive based on the Gibbs 
sampler outcomes, for different values of the hyperparameter a. EM and Gibbs samplers were 
initialized at (Aj° , . . . , Agg^) = (gg^j-, • • • , gggx) an d = 1) 5. The Gibbs samplers were run with 



Chess data can be downloaded from http://kaggle.com/chess 
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10,000 iterations and 1,000 burn-in iterations. The results are reported in Figure HI The results 
demonstrate the benefit of penalizing the skill rating parameters and the improvement brought up 
by a full Bayesian analysis. In Figure [5] we also report the autocorrelation function associated to 
the parameter 6 and the skill parameters with largest mean values. The Markov chain displays 
good mixing properties. 

395 



0.39 
0.385 

lil 

w 

^ 0.38 

a> 
w 

W 0.375 

h- 

0.37 
0.365 
0.36 

10° 10 1 10* 10 3 

a 




Figure 4: Test mean square error on the chess dataset for different values of the parameter a. 
Based on an history of 95 months, we predict the outcome of the games of the last 5 months. 



7 Conclusion 

The Bradley- Terry model and its generalizations a rise in numero us applications. We have shown 
here that most of the MM algorithms proposed in Hunter ( 2004I ) can be reinterpreted as special 



cases of EM algorithms. Additionally we have proposed original EM algorithms for some recent 
generalizations of the Bradley- Terry models. Finally we have shown how the latent variables 
introduced to derive these EM algorithms lead straightforwardly to Gibbs sampling algorithms. 
These elegant MCMC algorithms mix experimentally well and outperform a recently proposed 
M-H algorithm. 
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